Protein multiple sequence alignment benchmarking through secondary structure prediction
نویسندگان
چکیده
Motivation Multiple sequence alignment (MSA) is commonly used to analyze sets of homologous protein or DNA sequences. This has lead to the development of many methods and packages for MSA over the past 30 years. Being able to compare different methods has been problematic and has relied on gold standard benchmark datasets of 'true' alignments or on MSA simulations. A number of protein benchmark datasets have been produced which rely on a combination of manual alignment and/or automated superposition of protein structures. These are either restricted to very small MSAs with few sequences or require manual alignment which can be subjective. In both cases, it remains very difficult to properly test MSAs of more than a few dozen sequences. PREFAB and HomFam both rely on using a small subset of sequences of known structure and do not fairly test the quality of a full MSA. Results In this paper we describe QuanTest, a fully automated and highly scalable test system for protein MSAs which is based on using secondary structure prediction accuracy (SSPA) to measure alignment quality. This is based on the assumption that better MSAs will give more accurate secondary structure predictions when we include sequences of known structure. SSPA measures the quality of an entire alignment however, not just the accuracy on a handful of selected sequences. It can be scaled to alignments of any size but here we demonstrate its use on alignments of either 200 or 1000 sequences. This allows the testing of slow accurate programs as well as faster, less accurate ones. We show that the scores from QuanTest are highly correlated with existing benchmark scores. We also validate the method by comparing a wide range of MSA alignment options and by including different levels of mis-alignment into MSA, and examining the effects on the scores. Availability and Implementation QuanTest is available from http://www.bioinf.ucd.ie/download/QuanTest.tgz. Contact [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.
منابع مشابه
An Algorithmic Framework for the Study of Behavior of siRNA Sequences
The study about biological sequences is gaining momentum nowadays. An increasing number of researchers have proposed framework for the implementation of various algorithms for biomolecules sequence alignment and secondary structure prediction. A comparative study can also enhance the results but alignment and prediction algorithms vary widely in terms of both sensitivity and selectivity across ...
متن کاملProtein Secondary Structure Prediction: a Literature Review with Focus on Machine Learning Approaches
DNA sequence, containing all genetic traits is not a functional entity. Instead, it transfers to protein sequences by transcription and translation processes. This protein sequence takes on a 3D structure later, which is a functional unit and can manage biological interactions using the information encoded in DNA. Every life process one can figure is undertaken by proteins with specific functio...
متن کاملExtended Sequence Alignment Method for Protein Secondary Structure Prediction
Sequence alignment methods are very e ective for secondary structure prediction. However, they are only applicable when the similarity of the sequences is high enough. We previously reported that the extended sequence alignment method, which uses not only amino acid letters but also strings of amino acid letters representing motifs as comparing units, enabled us to nd common motifs even among t...
متن کاملComputational methods for protein secondary structure prediction using multiple sequence alignments.
Efforts to use computers in predicting the secondary structure of proteins based only on primary structure information started over a quarter century ago [1-3]. Although the results were encouraging initially, the accuracy of the pioneering methods generally did not attain the level required for using predictions of secondary structures reliably in modelling the three-dimensional topology of pr...
متن کاملProtein Secondary Structure Prediction Using RT-RICO: A Rule-Based Approach
Protein structure prediction has always been an important research area in biochemistry. In particular, the prediction of protein secondary structure has been a well-studied research topic. The experimental methods currently used to determine protein structure are accurate, yet costly both in terms of equipment and time. Despite the recent breakthrough of combining multiple sequence alignment i...
متن کامل